Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant and Slowly Diminishing Stepsize

نویسنده

Huizhen Yu

چکیده

We consider the emphatic temporal-difference (TD) algorithm, ETD(λ), for learning the value functions of stationary policies in a discounted, finite state and action Markov decision process. The ETD(λ) algorithm was recently proposed by Sutton, Mahmood, and White [47] to solve a long-standing divergence problem of the standard TD algorithm when it is applied to off-policy training, where data from an exploratory policy are used to evaluate other policies of interest. The almost sure convergence of ETD(λ) has been proved in our recent work under general off-policy training conditions, but for a narrow range of diminishing stepsize. In this paper we present convergence results for constrained versions of ETD(λ) with constant stepsize and with diminishing stepsize from a broad range. Our results characterize the asymptotic behavior of the trajectory of iterates produced by those algorithms, and are derived by combining key properties of ETD(λ) with powerful convergence theorems from the weak convergence methods in stochastic approximation theory. For the case of constant stepsize, in addition to analyzing the behavior of the algorithms in the limit as the stepsize parameter approaches zero, we also analyze their behavior for a fixed stepsize and bound the deviations of their averaged iterates from the desired solution. These results are obtained by exploiting the weak Feller property of the Markov chains associated with the algorithms, and by using ergodic theorems for weak Feller Markov chains, in conjunction with the convergence results we get from the weak convergence methods. Besides ETD(λ), our analysis also applies to the off-policy TD(λ) algorithm, when the divergence issue is avoided by setting λ sufficiently large. It yields, for that case, new results on the asymptotic convergence properties of constrained off-policy TD(λ) with constant or slowly diminishing stepsize.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Some Simulation Results for Emphatic Temporal-Difference Learning Algorithms

This is a companion note to our recent study of the weak convergence properties of constrained emphatic temporal-difference learning (ETD) algorithms from a theoretic perspective. It supplements the latter analysis with simulation results and illustrates the behavior of some of the ETD algorithms using three example problems.

متن کامل

On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several gradient-based TD algorithms with linear function approximation. The algorithms we analyze include: (i) two basic forms of two-time-scale gradient-based TD algorithms,...

متن کامل

On Convergence of Emphatic Temporal-Difference Learning

We consider emphatic temporal-difference learning algorithms for policy evaluation in discounted Markov decision processes with finite spaces. Such algorithms were recently proposed by Sutton, Mahmood, and White (2015) as an improved solution to the problem of divergence of off-policy temporal-difference learning with linear function approximation. We present in this paper the first convergence...

متن کامل

Ghost penalties in nonconvex constrained optimization: Diminishing stepsizes and iteration complexity

We consider, for the first time, general diminishing stepsize methods for nonconvex, constrained optimization problems. We show that by using directions obtained in an SQP-like fashion convergence to generalized stationary points can be proved. In order to do so, we make use of classical penalty functions in an unconventional way. In particular, penalty functions only enter in the theoretical a...

متن کامل

2753 1 Approximate Primal Solutions and Rate Analysis for Dual Subgradient Methods ∗

We study primal solutions obtained as a by-product of subgradient methods when solving the Lagrangian dual of a primal convex constrained optimization problem (possibly nonsmooth). The existing literature on the use of subgradient methods for generating primal optimal solutions is limited to the methods producing such solutions only asymptotically (i.e., in the limit as the number of subgradien...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Journal of Machine Learning Research

دوره 17 شماره

صفحات -

تاریخ انتشار 2016

Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant and Slowly Diminishing Stepsize

نویسنده

چکیده

منابع مشابه

Some Simulation Results for Emphatic Temporal-Difference Learning Algorithms

On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

On Convergence of Emphatic Temporal-Difference Learning

Ghost penalties in nonconvex constrained optimization: Diminishing stepsizes and iteration complexity

2753 1 Approximate Primal Solutions and Rate Analysis for Dual Subgradient Methods ∗

عنوان ژورنال:

اشتراک گذاری